﻿ Why would a robot make use of pronouns? An evolutionary investigation of the emergence of pronominal anaphora ,211 Dan Cristea1, Emanuel Dima, Corina Dima 1 Alexandru Ioan University of Iasi, Faculty of Computer Science 16, Berthelot St , 700486 Iasi, Romania 2 Romanian Academy, Institute of Computer Science, the Iasi branch {dcristea, dimae, cdima}@info uaic ro Abstract In this paper we investigate whether and in what conditions pronominal anaphora could be acquired by intelligent agents as a means to express recently mentioned entities The use of pronouns is conditioned by the existence of a memory recording the object previously in focus The approach follows an evolutionary paradigm of language acquisition Experiments show that pronouns can be easily included in a vocabulary of a community of 10 agents dialoguing on a static scene and that, generally, they enhance the communication success Keywords: anaphora, simulation, language emergence, artificial agents, language games 1 Introduction In this paper we study the acquisition of pronominal anaphora by intelligent agents in locally situated communication The research represents a step in the attempt to decipher the acquisition of language in communities of humans Recently, the historical interest to decipher the origins of language seems to be reopened Progress in this direction comes from approaches over language evolution, especially the experiments towards lexicon acquisition and grammar acquisition Models of language acquisition ( , ) hypothesise that language users gradually build their language skills in order to optimise their communicative success and expressiveness, as triggered by the need to raise the communication success and to reduce the cognitive effort needed for semantic interpretation The Talking Heads experiments ( , ) have already shown that a shared lexicon can be developed inside a community of agents which are motivated to communicate The participants in the experiments are intelligent humanoid robots, able to move, see and interpret the reality around them (very simple scenes of objects), as well as to point to specific objects They are programmed to play language guessing games in which a speaker and a hearer, members of a community of agents, should acquire a common understanding on a situation which is visually shared by both participants in the dialogue After tens of thousands of such games, played in pairs by members of the community, a vocabulary that gives names to concepts which are needed to differentiate the properties of objects spontaneously arises The vocabulary is shared by the majority of agents and is relatively stable at perturbing influences caused by population growth, decline, or mixing with other smaller groups It was shown that using multi words instead of single words could reduce the size of the lexicon, therefore yielding a more efficient communication system The next step deals with the acquisition of grammar recently showed how rudiments of grammar can be developed as a result of interactions The studied aspects touched the capacity of agents of inventing grammatical markers for indicating event structures, the formation of semantic roles, and the combination of markers into larger argument structure constructions through pattern formation A formalism used to model grammaticality in the evolutionary approach is Fluid Construction Grammar In the present paper we are interested to see if the acquisition of pronominal anaphora inside a community of intelligent agents can be empirically modelled by following an evolutionary approach and, if so, to point out which are the minimal cognitive requirements that allow the use of pronominal anaphors, how many interactions would be necessary for a pronoun to appear in the vocabulary of a community, and what is the communication gain if pronouns are used The world is extremely complex and a trial to bring it into the laboratory in order to model language acquisition in a natural setting is unrealistic Languages have evolved to manage the complexity of the world around us and it is clearly an error to consider that first the humans gained a sophisticated cognitive apparatus and only after that moment they started to invent the language It is known that the evolution of the human brain is closely correlated to the acquisition of language We are, therefore, forced to simplify the model we employ A few simple rules of a linguistic game are explained in section 2 In section 3 we ground the use of pronouns on a minimal cognitive infrastructure In section 4, some scene settings displaying increasing difficulty of comprehension are introduced The dialogue experiments are conducted in these settings and results are presented in section 5 Finally, section 6 summarises some conclusions 2 The Experiments Framework We organized a number of game-based experiments, during which the agents were expected to achieve rudiments of discourse-level performance For that, a Java framework, called Uruk , in which games can be easily defined and which allows for any number of experiments, has been developed By properly manipulating a number of parameters, Uruk can be made to support the description of the closed worlds, of the agents’ cognitive capacities and their lexical memories, as well as of their dialogues We consider a world as being composed of objects with properties (shape, colour, position, etc ), and a scene is a world with a specified configuration of objects The framework offers two ways to generate a scene: by manually describing all the objects populating it as well as all their properties, or by generating it randomly The number of objects generated in a random scene can be set within specified lower and upper bounds We do not speak about software agents stricto sensu ; in our framework they are not mobile, are not autonomous, and do not react to the change of their environment However, based on a learning process, they will arrive to possess rudiments of language The language is acquired through interactions in a community of similarly equipped agents, although not necessarily identical The only environment the agents can interact with is a scene which can be perceived through a number of perception channels Each channel targets a specific property of an object The acuity of agents on these channels can vary at will, agents being able to discretise a 0-to-1 range of continuous values, on each channel, in a set of categories, with different granularities We model this way the natural diversity within a community In much the same way, an average human being can perceive in the continuous spectre of colours only 7 important ones, while a painter distinguishes 100 nuances, for which she has names A game is a specified protocol of interaction between two agents An example of such a protocol is the "guessing game" , where one agent chooses an object in the scene, generates an utterance that describes it, and a second agent must guess the object described by the utterance (without knowing which was the chosen object) If the object is correctly indicated, the trust of both agents in the proper usage of the words describing the conceptualisation of the object increases If the object is not guessed, a repairing strategy is applied: the speaker points to the object he has chosen and the trust it has in the words used to name it decreases, while the hearer either learns the new words or associates a greater level of trust for this expression to describe the conceptualisation of the object After a large number of games of this kind, played in pairs by agents, the community shares a common vocabulary and associates words to categories (concepts) with a high level of trust The game develops as follows The first turn belongs to the speaker He silently chooses an object, which will be known as the focus, from the objects that are present in the scene and runs a conceptualisation algorithm to find all sets of categories that can unambiguously distinguish the focus among all the other objects (e g red square in a scene in which there is only one red square among other objects) From the found list of sets of categories a lexicalisation algorithm then selects just one set of categories for which the most appropriate lexical expression can be formed The lexicon of the agent is a set of 3-uples of the form (category, word, confidence) Such a 3-uple is an association between a category and a word, weighted by a relative confidence factor Let’s note that the correspondence between the conceptual space and the lexicon of an agent can be a many-to-many mapping between categories and words (one word can be ambiguous because it can designate more categories, and a category can be named by a set of synonymous words) When producing an utterance (as a speaker), an agent needs to find out the word that describes best a certain category among its set of synonyms, and vice-versa, when deciphering an utterance (as a hearer) she has to associate the most plausible concept to an ambiguous word The confidence factor is used in precisely this situation: the agent scans its lexicon and selects the association with the maximum confidence factor between all the associations with the designated category (in speaking) and word (in hearing) In order to find the optimal set of words describing the distinctive categories, the lexicalisation algorithm computes scores for each found set of discriminating categories as the average of the best associations (words with the highest confidence factors that the agent knows for the chosen categories) Then the set of words with the highest score is selected The winning expression is the shortest one (in number of items) when only one such set expression exists When more than one winning sets have the same size, the one with the maximum confidence score is chosen This strategy implements the principle of economy of expression Only when there is more than one set with the maximum confidence and the same minimum size, the winning expression will be decided randomly The chosen expression is then “spoken”, i e transferred to the hearer The second turn of a multi-game belongs to the hearer, who tries to decompose the transferred string and to find an object in the environment that would correspond to the description The “heard” string is first tokenized by the hearer into elementary words Then the agent matches each one of these words with the most likely category, by scanning her own lexical memory and selecting the category that is associated with a word that is present in the description with the highest confidence In the best scenario, the agent would exactly match the words and reproduce the original set of characteristics Based on this set, the hearer is able to judge the identity of an object in the scene, which, luckily, will be the same as the intended focus in the speaker’s mind In an average simulation, though, some words could be unknown to the hearer and some could be associated to a different category than the one used by the speaker In any case, the result of this interpretation step is a (possibly empty) set of characteristics The hearer will now find the objects in the world that possess all these characteristics When more than one object could be a possible target, the agent will simply choose one at random If the hearer has a plausible candidate, she will point to it, or otherwise, when the decoding gives an empty set of objects, an “I don’t know” kind of message is issued Next, if there is one object pointed to by the hearer, it is compared against the object chosen by the speaker (the focus) Based on this comparison, the game can succeed (the indicated object by the hearer is indeed the focus chosen by the speaker) or fail (no object identified, or the one indicated is a wrong one) Both conceptual spaces of the speaker and the hearer are influenced by the result of the game In most of the success cases both agents have the same word-category associations, and their confidence in these associations will be increased by a relatively large quota It could happen, however, to finish a game with success, although the interpretation of some of the words used by the agents is different In this case the agents managed to communicate, but only by chance, and the increase of confidence will be misleading It is expected that this erroneous conclusion will be penalized later in other interactions In the fail case, the object answered by the hearer being different than the focus chosen by the speaker, both agents conclude that the words they have used to name categories were inappropriate and so they decrease the confidence of these word- category pairs The negative quota of penalty is somewhat lower than the positive quota used in a successful game, such that a successful game outweighs a failed one The next step is the learning phase that takes place only in case of fail A game can fail because the hearer didn't know the words used by the speaker, so the hearer is supposed to improve her word-category associations The speaker shows to the hearer the identity of the intended focus object With the knowledge of the true object and of the words that describes it, the hearer retraces the operations made by the speaker and computes the list of discriminating sets of categories As she knows that the number of words equals the number of categories, she only retains the sets that have the same size as the list of words However, as she doesn't know the order in which these categories were used, she stores the words in her internal memory, as every possible association between the words used and the distinctive categories The confidence factor for each association is set as the maximum possible confidence divided by the number of associations The final step of the game is purging of the lexicons During this phase all associations whose confidence decreased below a minimum threshold are removed from the agents’ memories The overall theoretical complexity of the algorithms used for finding the name of an object and for learning is high Finding the name of an object is exponential on the number of channels This happens because all subsets of the categories that an object falls into are computed, in order to lexically describe it The learning algorithm is factorial on the number of lexemes used by the speaker However, this unacceptable theoretical complexity does not harm too much the running time, because the number of perception channels of an agent is fixed and usually small (the current experiments set it at 5) Also, the distinctive categories used to identify an object are bounded by the number of perception channels of the agent 3 The Inception of Pronouns Anaphora represents the relationship between a term (called ‘anaphor’) and another one (called "antecedent"), when the interpretation of the anaphor is in a certain way determined by the interpretation of the antecedent When the anaphor refers the same entity as the antecedent, we say that the anaphor and the antecedent are coreferential When the surface realisation of the anaphor is that of a pronoun, the coreference relation also fulfils other functions: - it brings conciseness in the communication, by avoiding direct repetitions of a previous expression, thus contributing to the economy of expression – a central principle in the communication between intelligent agents; - it maintains the attention focused on a central entity, by referring it with extremely economical evoking means Indeed only entities which already have a central position in the attention could be referred by pronouns and, once referred, their central position is further emphasised Anaphora, as a discourse phenomenon, presupposes non-trivial cognitive capacities The one we are concerned with in this paper is the capacity of memorising the element in focus This capacity is so central and elementary that we decided to consider it as being provided by a dedicated “perception” channel – actually a memory channel Indeed, both cognitive aspects of distinguishing between right and left (to give a common example of perception) and of remembering that a certain object was in focus recently involve primitive cognitive functions The lack of memory would make a dialogue impossible, the same way as the lack of spatial perception abilities would make the recognition of spatial relations impossible The focussing memory is modelled through a channel called previous-focus, with two values [ true, false] Excepting for the first utterance of the dialogue, when there is no previously focused entity, on each subsequent utterance there is one entity (object) which is remembered as being the focus of the previous game As such, each object in the scene has a value of false on the previous-focus channel, except for the object which has been in focus previously, and whose corresponding value on this channel is true It is clear that modelling the memory of objects previously in focus as a one-place channel (only one object can have the value “true”) is a severe simplification, which we accept in this initial shape of the experiments In reality, human agents are known to be able to record many more discourse entities already mentioned, on which pronouns can afterwards be anchored The differentiating properties (of pronouns) can include features like animacy, gender and number Using these properties, as well as the syntactic structure and the discourse structure, a sentence could include more than just one pronoun, each referring unambiguously to a different antecedent We are not concerned here to model the cognitive processes that make possible the recognition of objects We simply assume that the agents have either the capacity to distinguish the focussed object among the other objects, based on its intrinsic properties, or that the agent eye-tracks the objects from a first position to a second one between games In the first case the object would have to be identified again, based on the memorized specific features (the memory channel resembling more a memory cell), while in the second case the identity of the focussed object would have to be maintained rather than regained from memory The type of games we are interested in when modelling anaphoric phenomena are multi-turn, such that one entity, which has been already in focus previously, could be referred again later In this study, we are targeting only pronominal anaphors If we want an agent to develop the ability of using pronouns, the dialogue should include a sequence of utterances in which an entity is mentioned more than once 4 The Settings The problem we are concerned with is when and why intelligent agents would develop linguistic abilities for using anaphoric means in communication and how anaphora could complete a conceptualisation It is clear that an agent could have at least two reasons for choosing to name an object by a pronoun: - because it uses less words (for instance, it instead of the left red circle); - because this way the OLD (therefore, the entity previously in focus) is explicitly signalled, maintaining it there On the other hand, an agent has also at least one reason why not using a pronoun: - because it could introduce an ambiguity The use of pronouns should emerge naturally during the experiments, solely by modelling these contrary tendencies It should, therefore, not be enforced (given programmatically) To model the acquisition of pronominal anaphora, four different settings have been used, which we believe present an ascending degree of complexity All are anchored on a two-turn game What make the difference between these settings are the changes in the scene of the second turn as compared to the first, as well as the chosen focus In the first setting (Figure 1) both games are played in the same scene and the co- speaker will focus in the second game the same object that the speaker focussed in the first Figure 1: Setting 1 Turn 1: A names obj2 by low left Turn 2: B names obj2 by that In the second setting (Figure 2) new objects are introduced in the scene of the second game, while the focus remains unchanged Figure 2: Setting 2 Turn 1: A names obj2 by low left Turn 2: B names obj2 by that In the third setting (Figure 3), the objects in the second game's scene are shuffled (their spatial properties like horizontal and vertical position are randomly changed) The co-speaker will keep the focus on the same object, although it might have changed its position Figure 3: Setting 3 Turn 1: A names obj2 by low left Turn 2: B names obj2 by that In the fourth setting (see Figure 4), the scene of the second game is again a shuffled version of the scene in the first game and the focus can no longer be identified by any of the attributes used in the first game In this particular scene, the agents do not distinguish colours or shapes, so the objects can be identified only through position and anaphoric means Figure 4: Setting 4 Turn 1: A names obj2 by left Turn 2: B names obj2 by that All experiments have been run with the following parameters: number of agents: 10; number of multi-games: 5000; number of objects in the scenes: between 5 and 10; channels: "hpos" (horizontal position), "vpos" (vertical position), "color", "shape“, “previous-focus”; channels granularity (number of distinguishable values on channels): from 2 to 4 A number of 4 values on the “hpos” channel, for instance, could mean: “extreme-left”, “left”, “right” and “extreme-right” As mentioned, the “previous-focus” channel has 2 values: “true” and “false” As we see, in every multi-game the focus is maintained on the same object in both turns Let us notice that it makes no difference who the speaker is in the second game Only for the sake of displaying a dialogue we considered the second utterance as produced by the co-speaker 5 The Results Figures 5-9 display the average success rates along 10 series of 2500 multi-games, for different configurations of objects and settings The success rate is considered to be, at a certain moment in time, the percent of game successes in the previous 100 games Figures 5 and 6 show the success rate in setting 1, with scenes counting 5 and, respectively, 8 objects, while Figures 7-9 display the success rate in settings 2-4 when there are 8 objects in the scene In all experiments, only multi-games which reported success after the first turn have been retained, as we were interested here only in the acquisition of pronouns (mentioned only in the second turn in each multi-game) and not in a stabilisation of a lexicon in general So, if at the end of the first turn, agent B does not recognise the object indicated by agent A, the game is stopped and disregarded In all figures, the (black) line above reports the percent of general success rate (after the second turn), while the (gray) line below reports the success rate that is due to the use of pronouns The abruptly growing shapes of the lines above, in all four settings, show that, very quickly, the agents acquire a common understanding of the objects in the scene (to be more precise – over the object in the focus) In general, after 300-400 games, the success rate stabilises to 100% However, as the (gray) lower lines show, in fewer cases this common understanding is due to the use of pronouns This should not be interpreted as an indication of the fact that the use of pronouns reduces the success rate, but that in some cases other referential expressions than pronouns are also used to identify an object which has been previously in focus (for example, in setting 3, Figure 3, B can use down instead of that ) However, if we compare Figures 5 and 6, we see that when the number of objects is larger, the need to use pronouns also goes up This is clearly due to the fact that a greater agglomeration of objects in the scene makes their identification based on other features than being recently in focus more ambiguous Indeed, the agents chose randomly among the shortest best known categorisations which one to use for identifying the object in focus from those able to individualise it unambiguously If all possible utterances have the same length and the same confidence (the confidence of a linguistic expression is calculated as the mean value of the confidence of the words used to utter the corresponding categorisation) one of them is chosen randomly Choosing the shortest utterance is the only bonus that favours the economy of expression, therefore the use of pronouns The graphs show that when there are more objects in the scene, being recently in focus remains the conceptual feature with the highest confidence Figure 5: Setting 1 – with 5 objects Figure 6: Setting 1 – with 8 objects Figure 7: Setting 2 – with 8 objects Figure 8: Setting 3 – with 8 objects Figure 9: Setting 4 – with 8 objects An interesting thing is revealed by the graph in Figure 9: the two lines representing global success rate and pronoun-based success rate are practically identical This means that when the situation is very complex, in almost all cases the agents prefer to use a pronoun to identify an already mentioned object Finally, we were interested to see what happens when we impose the use of pronouns Figure 10 shows two lines, both drawn for setting 1: the lower (black) line represents the normal use of pronouns in the case of success in the second turn, while the upper (gray) line represents the success rate in the second turn when we enforced the use of pronouns The experiment shows that the particular conditions of this setting make superfluous the need for more than one channel (in this case “previous- focus”) to identify the focus Indeed, practically, each time a pronoun is used the success is guaranteed Figure 10: Imposing the use of pronouns – setting 1 6 Conclusions In this paper we have advocated that the acquisition of pronouns in language can follow an evolutionist pattern, therefore pronouns can appear in language as a natural, spontaneous, process, driven by the necessity of the agents to acquire common understanding over a situation The study does not show, however, that this is the only way in which pronouns could have appeared in natural languages It simply shows a possibility Although the experiment was successful, it is also not realistic to claim that our model of representing anaphoric phenomena and their emergence in a community of artificial agents copies the natural way in which anaphora appeared in languages We can only try to guess the cause of anaphora's natural inception The limits of the experiment are obvious We have used a paradigm in which a community of agents communicate A common agreement over a focussed object in a scene is rewarded by an enhancement of the trust in both the conceptualisation used and the linguistics means to express it After a number of experiments, a certain lexicon is acquired by the community The model we used has considered the existence of a memory channel remembering the object recently in focus When such a channel is open, the identification of an object already mentioned, and which should be mentioned again, can be made quicker and with less ambiguity because it implies less categorisation The linguistic expression of this economic categorisation is the pronoun The experiments show a clear tendency of the agents to enhance their linguistic ability to use pronouns in more and more complex contexts When the number of objects in the scene increases, the chance that the “previous- focus” channel is the only channel that uniquely identifies an object is very high and therefore the use of pronoun becomes dominant In the future it would be interesting to study what are the semantic features that attract the specialisation of pronouns? Can the categories of male/female, animate/inanimate and singular/plural, as they are used to differentiate pronominal forms in most languages, be generalised? Could a class of experiments intended to put in evidence the different semantic features of anaphoric expressions be imagined within the limited worlds of the ‘talking heads’? Another thing that we don’t know yet is which are the levers that can be triggered to restrain the proliferation of lexical forms of pronouns in the community of agents, as in most natural languages there are very few synonyms to express one category of pronouns Amongst the details of improving the framework, apart from the simulations in which parallel machines shall be used as hardware support to speed up the execution, there are also some other theoretical questions to investigate Open problems are the connection between the short-term memory and the anaphora, the possibility of simulating distinct pronominal categories like the natural-language forms that are connected with gender (which can not only be masculine or feminine but also “classes of hunting weapons, canines, things that are shiny” ), number and others, as well as a resolution algorithm for these cases of ambiguity References 1 Steels, L : The synthetic modeling of language origins In: Evolution of Communication, 1(1):1-35 (1997) 2 Briscoe, T : Linguistic evolution through language acquisition: formal and computational models Cambridge Univ Press (1999) 3 Steels, L : Self-organizing vocabularies In C G Langton (ed ), Proceeding of Alife V (1997) 4 Steels, L : The Talking Heads Experiment Volume 1 Words and Meanings Special pre-edition for LABORATORIUM, Antwerpen (1999) 5 Van Looveren, J : Design and Performance of Pre-Grammatical Language Games Ph D thesis, Vrije Universiteit Brussel, Brussels 53,155 (2005) 6 Van Trijp, R Analogy and Multi-Level Selection in the Formation of a Case Grammar A Case Study in Fluid Construction Grammar, Ph D thesis, University of Antwerpen (2008) 7 Steels, L Fluid Construction Grammar Ellezelles Course Notes (draft version) Sony Computer Science Laboratory Paris, Artificial Intelligence Laboratory, VUB, Brussel (2008) 8 Steels, L : Intelligence with representation Philosophical Transactions of the Royal Society A, 361, 2381–2395 51 (2003) 9 Dima, E : Anaphoric Phenomena In Evolving Lexical Languages, master thesis, Department of Computer Science, Alexandru Ioan Cuza University, Iasi (2009) 10 Nwana, H S : Software Agents: An overview Cambridge University Press (1996) 11 Lust, B : Introduction to Studies in the Acquisition of Anaphora, D Reidel (1986) 12 Boroditsky, L : How does our language shape the way we think? In: What's Next: Dispatches on the Future of Science, Vintage (2009) 